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ABSTRACT 


In  this  study  we  have  analyzed  the  impact  of  air  pollution  in  day  to  day  life  in  all  aspects.  The  main  focus 
of  this  contribution  is  learning  about  modeling  of  data  by  supervised  algorithms  i.e.  (Linear  Regression 
(regression)  and  Logistic  Regression  (classification)  and  its  consequences.  This  particular  analysis  of  Air 
Pollution  Impact  (India  &  US),  and  factors  that  affects  AQI.  The  dataset  we  have  used  comprises 
concentration  of  pollutants  and  there  is  needof  each  of  it  for  calculating  the  air  quality  index,  so  that  is  been 
calculated  further  in  the  process  and  has  been  utilized  in  analysis.  Here  we  also  seen  the  combination  of  the 
independent  variables  (Interaction  effect)  and  its  impact  on  dependent  variable  and  the  accuracy  of  the  model 
variation  as  well  as  interdependence/  correlation  (Multicollinearity)  between  various  independent  variable 
and  its  adverse  effect  on  the  dependent  variable  and  on  the  given  data  model.  The  solution  to  the  problems  of 
multicollinearity  is  also  been  discussed  in  the  following  kernel  i.e.  Regularization  and  Stepwise  Regression. 
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i.  Introduction 

As  the  world  is  upgrading  in  terms  of  technology 
and  resources  us  the  human  beings  are  somehow 
neglecting  the  nature’s  miracle.  As  a  result  we  are 
just  building  a  model  to  destroy  ourselves.  This 
paper  mainly  focuses  on  that  nature’s  forecast 
which  we  generally  don’t  recognize  at  its  best,  the 
air.  Air  &  water  are  the  2  main  resources  without 
which  no  animals,  insects  and  even  human  beings 
can’t  survive.  [1]  So  this  paper  fully  demonstrates 
the  culture  and  quality  of  air  in  India.  As  we  know 
the  air  pollution  with  the  use  of  factories  and 
vehicles  running  of  petrol  &  diesel  is  increasing  day 
by  day  [2].  So  we  have  analyze  the  dataset 
extracted  from  “Haggle”  &  “GitHub”  of  those  2 


countries  and  build  a  regression  model  to  know 
how  much  and  how  this  has  increased  with  the 
appropriate  cause[3][4]. 

II.  Exploratory  Data  Analysis 

Exploratory  data  analysis  (EDA)  is  an  approach  to 
analyzing  data  sets  to  summarize  their  main 
characteristics,  often  with  visual  methods  [5].  A 
statistical  model  can  be  used  or  not,  but  primarily 
EDA  is  for  seeing  what  the  data  can  tell  us  beyond 
the  formal  modeling  or  hypothesis  testing 
task[6][7][8].As  this  becomes  the  first  step  of 
analysis,  we  have  imported  some  of  the  valuable 
libraries  to  support  our  program  &  the  task  of 
study  [9]. 
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In  [3]:  import  mimpy  as  np 
import  pandas  as  pd 
import  seaborn  as  sns 

from  sklearn . preprocessing  import  Imputer 
import  matplotlib . pyplot  as  pit 
%matplotlib  inline 

pit. rc Pa rams [ 1  figure. figsize 1 ]  =  {10.,  7} 

from  sklearn .model_selection  import  t rain_test_spiit 

from  sklearn . linear _model  import  LinearRegression 

from  sklearn .metrics  import  mean_squared_log_error 

from  sklearn .metrics  import  mean_squared_error 

from  sklearn .metrics  import  r2_scoreJ  mean_squared_error 

from  sklearn .feature_selection  import  RFE 

from  sklearn. linear_model  import  Ridge 

from  sklearn . linear_model  import  Lasso 

import  statsmodels .formula. api  as  sm 

from  sklearn .model_selection  import  KFold 

from  sklearn  .model_selection  import  cross_val_s<=ore 

from  statsmodels.  regression  .  line  a  pjmod  el  import  OILS 

from  statsmodels. tools  import  add_constant 

from  sklearn . preprocessing  import  StandardScaler 

from  sklearn . linear_model  import  LogisticRegression 

from  sklearn  .metrics  import  classif  icatio>n_report 

from  sklearn  import  metrics 

from  statsmodels. stats .outliers_inf luence  import  variarce_inf 1 at ion_f actor 
import  warnings;  warnings . simplef liter ( 1  ignore  1 ) 

Fig  1:  Libraries  Imported 


These  libraries  help  us  throughout  the  analysis 
and  demonstration.  Also  these  helps  us  to 
understand  the  model  generated  by  regression 
[10].  Extracting  the  dataset  we  have  generated  a 
visualization  bar  graph  for  proper  understanding  of 
AQI  in  different  states  of  India.  The  graph  shows  us 
the  normal  AQI  of  Indian  States.  It  includes  mostly 
all  the  reputed  cities  in  the  country  India.  The  best 
way  to  study  this  particular  graph  is  higher  the 
value,  more  polluted  the  city  is [  1 1  ]  . 

The  scale  of  this  normal  AQI  city  graph  is  up  to 
200+ ,  which  is  not  a  good  indication  for  the 
environment.  The  disease  related  to  respiratory 
and  also  relevant  to  some  skin  allergies  are  more 
likely  to  occur  in  those  regions  [12]  [13].  Thus,  we 
will  predict  the  index  measure  with  the  help  of  the 
ranges  given  by  the  respective  governments  of  India 
and  United  States  [14-22]. 

As  we  have  analyze  the  graph  of  cities,  now  we  will 
have  a  look  to  the  state  wise  AQI  in  the  country. 


Fig  2:  AQI  Values  Ranges 


Fig  3:  AQI  Values  Ranges  (States) 


III.  Methodology 
Data  Cleaning  &  Aqi  Values 

As  the  dataset  when  received  from  the  respective 
files,  contained  several  null  values  in  columns  and 
rows.  So  proceeding  with  these  null  values  will 
generate  some  disturbance  and  uneven  in  the 
model  in  our  results. 

The  best  to  normalize  the  handling  of  null  values 
is  by  using  a  mathematical  function  and  chapter 
called  as  central  tendency[23-26].  We  extracted  the 
column  data  and  row  data  and  calculated  the  mean 
value  of  that  particular  function  present  in  the 
dataset  and  filled  those  null  spaces  with  that  mean 
values.  This  is  the  process  of  null  value  handling  or 
data  handling. 
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City  State  \ 

0  Amaravati  Andhra  Pradesh 

1  Rajamahendravaram  Andhra  Pradesh 

2  Tirupati  Andhra  Pradesh 

3  Visakhapatnam  Andhra  Pradesh 

4  Guwahati  Assam 

5  Gaya  Bihar 

6  Gaya  Bihar 

7  Haripur  Bihar 

8  Muzaffarpur  Bihar 

9  Patna  Bihar 

Fig  4:  Null  Values 

These  are  cities  with  their  respective  states  which 
are  having  null  values.  Handling  the  null  values 
becomes  very  important  because  sometimes  they 
behave  abnormally  and  disrupt  the  visualization. 

Current  AQI  value 


State 

Andhra  Pradesh 

53.150538 

Assam 

198.760000 

Bihar 

130.773684 

Chandigarh 

44.640000 

Delhi 

106.601542 

Gujarat 

102.522727 

Haryana 

86.364929 

Jharkhand 

136.600000 

Karnataka 

64.952941 

Kerala 

61.555556 

Madhya  Pradesh 

86.573407 

Maharashtra 

81.536957 

Meghalaya 

81.833333 

Mizoram 

64.550000 

Odisha 

129.321429 

Punjab 

61.230337 

Rajasthan 

80.690678 

Tamil  Nadu 

62.422222 

Telangana 

71.360825 

Uttar  Pradesh 

115.253385 

West  Bengal 

95.607692 

Model- 1 


Fig  8  :  LR  Model- 1 


Model-2 


Fig  6:  LR  Model-2 

Model-3 


Fig  7:  LR  Model-3 


Fig  5:  State  AQI  Values 


ALGORITHMS  &  MODELS 
LINEAR  REGRESSION 

Linear  regression  is  basically  a  linear  approach  to 
model  the  relationship  shared  between  a  scalar 
response  (or  dependent  variable)  i.e.  AQI  and  one 
or  more  explanatory  variables 


89  International  Journal  for  Modern  Trends  in  Science  and  Technology 


. mm ■■ ■■  ■  ■  mm  wmm  ■  Xlf  mm^m  ,  ■■■■  ■■■■■■h ■■  - ■  ■  ■  ■  _a ■  ■  ■ ■  ■  ■  ■  ■  ■  • 

Debjyoti  Saha  and  Shashikant  Patil,  " Analysis  &  Demonstration  of  Impact  of  Air  Pollution" 


LOGISTIC  REGRESSION 
Model- 1 


False  Positive  Rate 


Fig  8:  Logistic  Model- 1 

Model-2 


Fig  9:  Logistic  Model-  2 


Model-3 


0.0  02  0.4  0.6  08  LO 

False  Positive  Rate 


Fig  10:  Logistic  Model-3 

IV.  RESULTS  &  CONCLUSIONS 

The  result  table  is  the  appropriate  summary  table 
of  the  above  models.  This  will  tell  us  the  model  will 
bring  the  most  accurate  predictions  and  cure  for 
this  pollution.  Here  we  have  discussed  and 
addressed  various  issues  related  to  air  pollution 
and  air  quality  and  the  usage  of  various  statistical 
model  for  deciding  suitable  model  and  selection  of 
model  as  well  as  its  effect  along  with  variation  in 
different  parameters.  In  future  we  will  use  neural 
networks  and  machine  learning  approach  for 
advancement  in  studies  and  betterment  in 
outcome 


OLS  Regression  Results 


Dep .  Var  i  able : 

0 . 998 

AQI 

R- squared: 

Model : 

OLS 

Adj  .  R- squared: 

0.998 

Method: 

3 . 605e+07 

Least  Squares 

F-statistic: 

Date : 

Sat,  02  Mar  2019 

Prob  [F-statistic) : 

0.00 

Tijne: 

1.2978e+06 

04:43:16 

Log- Likel i hood : 

No.  Observations : 

2 . 5  96e+0 6 

348588 

AIC: 

Df  Residuals: 

348584 

BIC: 

2 . 596e+06 

Df  Model : 

4 

Covari ance  Type : 

nonrobust 

Omnibus : 

2.003 

472975.883 

Durb  i n-Wats  on  : 

Prob  [Omnibus }  : 
24702^613.534 

0.000 

Jarque-Ber  a  [  JB }  : 

Skew : 

0.00 

7.551 

Prob [JB) : 

Kuxtosis : 

24.7 

132.536 

Cond.  No. 

Fig  11:  Least  Square  Results 
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Model  Summary 


Results:  Logit 


Model : 

Dependent  Variable: 
Date : 

No.  Observations: 

Df  Model : 

Df  Residuals : 
Converged: 

No.  Iterations: 


Logit 
type_label 
2019-03-02  04:45 

435735 
9 

435725 

1.0000 

5.0000 


Pseudo  R- squared: 
AIC: 

BIC: 

Log-  Likel  ih  ood : 
LL-Null : 

LLR  p-value: 
Scale: 


0.017 

583668.1702 
583778.0181 
-2 . 9182e+05 
-2 . 9687e+05 
0.0000 
1.0000 


Fig  12:  Logit  Model  Results 
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